-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[AMDGPU] SelectionDAG support for vector type set 0 to multiple sgpr64 #128017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-backend-amdgpu Author: Janek van Oirschot (JanekvO) ChangesZeros seem to materialize into a sequence of Patch is 205.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128017.diff 40 Files Affected:
|
You can test this locally with the following command:git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 553da9634dc4bae215e6c850d2de3186d09f9da5 f7e6f7b7f3bd16943a7874a717dbec2e04b5156f llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp llvm/test/CodeGen/AMDGPU/adjust-writemask-cse.ll llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll llvm/test/CodeGen/AMDGPU/bitreverse-inline-immediates.ll llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll llvm/test/CodeGen/AMDGPU/bug-cselect-b64.ll llvm/test/CodeGen/AMDGPU/cluster_stores.ll llvm/test/CodeGen/AMDGPU/collapse-endcf.ll llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll llvm/test/CodeGen/AMDGPU/fcanonicalize.ll llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-nondeterminism.ll llvm/test/CodeGen/AMDGPU/flat-scratch.ll llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll llvm/test/CodeGen/AMDGPU/hazard-recognizer-src-shared-base.ll llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll llvm/test/CodeGen/AMDGPU/imm.ll llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll llvm/test/CodeGen/AMDGPU/issue92561-restore-undef-scc-verifier-error.ll llvm/test/CodeGen/AMDGPU/issue98474-need-live-out-undef-subregister-def.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.ll llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll llvm/test/CodeGen/AMDGPU/max-hard-clause-length.ll llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll llvm/test/CodeGen/AMDGPU/mfma-loop.ll llvm/test/CodeGen/AMDGPU/module-lds-false-sharing.ll llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll llvm/test/CodeGen/AMDGPU/scalar-float-sop2.ll llvm/test/CodeGen/AMDGPU/scheduler-rp-calc-one-successor-two-predecessors-bug.ll llvm/test/CodeGen/AMDGPU/set_kill_i1_for_floation_point_comparison.ll llvm/test/CodeGen/AMDGPU/si-optimize-vgpr-live-range-dbg-instr.ll llvm/test/CodeGen/AMDGPU/si-scheduler-exports.ll llvm/test/CodeGen/AMDGPU/smfmac_no_agprs.ll llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll llvm/test/CodeGen/AMDGPU/vopc_dpp.ll llvm/test/CodeGen/AMDGPU/waterfall_kills_scc.ll llvm/test/CodeGen/AMDGPU/wqm.llThe following files introduce new uses of undef:
Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields In tests, avoid using For example, this is considered a bad practice: define void @fn() {
...
br i1 undef, ...
}Please use the following instead: define void @fn(i1 %cond) {
...
br i1 %cond, ...
}Please refer to the Undefined Behavior Manual for more information. |
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set of peephole opt patches I'm working on improve some of these type of situations.
You also have a number of regressions in the VALU cases. Should do something about those (which may mean not doing this for VALU values). Also should consider v_mov_b64 on gfx940
| if (ISD::isConstantSplatVectorAllZeros(N)) | ||
| return true; | ||
|
|
||
| // Types may have legalized by stripping the 16 bit multi-element vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to adjust the legalizer to do this in the first place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, allow vector types as long as they splat 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if they don't splat 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The splatness isn't significant, and if we're going to have 64-bit vector elements we should have made them legal throughout the DAG, not just during selection
| ; GFX9-NEXT: v_mov_b32_e32 v0, 0 | ||
| ; GFX9-NEXT: v_mov_b32_e32 v1, v0 | ||
| ; GFX9-NEXT: v_mov_b32_e32 v1, 0 | ||
| ; GFX9-NEXT: v_mov_b32_e32 v2, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regression
| ; GFX9-NEXT: v_mov_b32_e32 v0, 0 | ||
| ; GFX9-NEXT: v_mov_b32_e32 v1, v0 | ||
| ; GFX9-NEXT: v_mov_b32_e32 v1, 0 | ||
| ; GFX9-NEXT: v_mov_b32_e32 v2, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regression
Looking into it
Was going to be a follow up patch to fold the sgpr64 set 0 cases into vgpr64 in SIFoldOperand (perhaps should be peephole opt?) |
I am speculating that there are several things going on, some of which are going to be improved by a set of patches I'm working on which is taking significantly longer than I expected. Related to that, I was thinking about other ways we can improve the immediate folding. I've been working on enabling MachineCSE to work for subregister extracts, and improving folds of copies through sub registers. In doing this, I see a number of places where we get better coalescing and re-assemble materialize of 64-bit inline immediates. I've been thinking we should teach foldImmediate and/or SIFoldOperands to reconstruct 64-bit moves. e.g. foldImmediate can see the use instruction is a reg_sequence with the same register used twice, and replace it with an s_mov_b64. It also currently skips any constants with multiple uses, but the correct heuristic is probably more refined (like only skip multiple uses for non inline-immediate) |
This is exactly what would fix the aforementioned regressions. In the case prior to s_movb64 materialization, the MachineCSE eliminates the duplicate s_mov_b32 but it doesn't have the logic for subregister materialized immediate CSE'ing.
One of the reasons I tried to solve this in the selectiondag is because it has non-opaque and nicely described helpers for checking if there's a splatted vector. Additionally, my plan was to add s_mov_b64 -> v_mov_b64 folding (for gfx942+) in SIFoldOperands and I didn't have to fit in both s_mov_b32 x2 -> s_mov_b64 and s_mov_b64 -> v_mov_b64 in the same pass. Anywho, I'll check some of the mentioned alternatives cause this approach also requires a separate change for globalisel to do the same |
|
Rebase |
|
I've been looking at resolving the regression but having a hard time finding a good MachineCSE alternative. Tried to convince the complex pattern to associate the added Additionally, I bailed out of moving this selection code into |
| BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::V_ACCVGPR_WRITE_B32_e64), Tmp) | ||
| .addImm(Imm); | ||
| B.addReg(Tmp); | ||
| // Adding multiple immediates in case of 64 bit MOVs to agpr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is obsolete with #129463
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to rebase wrt your changes and add a similar split-b64-for-agpr
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any of these test changes include the example you are trying to fix? I have several patches in flight which largely address similar issues
I've added https://github.com/llvm/llvm-project/pull/128017/files#diff-a073dcad4b3da7b790117b5218e1e7d7bd24eb963ac0d54b601f7f6fb001e453 Additionally, my follow up patch will fold |
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you separately pre-commit the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need -O3, probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use named values in tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add something to the name to indicate the value kinds? These are all uniform SGPR ptr + divergent mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the comment (don't actually need the declarations either)
I've put up #129703 and yesterday's PR #129464 has already helped with removing the intermediate |
Any ideas for alternatives to implementing subreg awareness to MachineCSE for the regression of an additional vgpr use in cases like |
|
Rebase |
|
Shouldn't be resolved in selection; move b64 immediate materialization (through v_mov_b64, if available) to amdgpu fold mechanism. |
Zeros seem to materialize into a sequence of
s_mov_b32s, even whens_mov_b64is possible. First step to solving a worse case when 0 should be materialized into vector type vgprs as it first materializes into sgprs beforemoving into vgprs.